unique and duplicated warn on other encodings than UTF8#7379
unique and duplicated warn on other encodings than UTF8#7379ben-schwen wants to merge 8 commits intomasterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7379 +/- ##
=======================================
Coverage 99.12% 99.12%
=======================================
Files 85 85
Lines 16637 16640 +3
=======================================
+ Hits 16492 16495 +3
Misses 145 145 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Let's see revdeps. If none affected then I would keep it like this. |
|
This may bite fread() users who rely on the default value of the
encoding= argument:
``` r
fwrite(list(x = c('a møøse', 'bit my sister once')), 'foo.txt')
fread('foo.txt', sep='\t') |> setkey(x) |> unique() |> _$x |> str()
# chr [1:2] "a møøse" "bit my sister once"
# Warning message:
# In unique.data.table(setkey(fread("foo.txt", sep = "\t"), x)) :
# Mixed encodings detected. Strings were coerced to UTF-8 before
# unique(x).
```
A similar problem may happen to users of old versions of R or old
versions of Windows where non-ASCII string literals would result in
CE_NATIVE strings instead of CE_UTF8.
Should the warning recommend enc2utf8()?
Could someone please share the original context from R-Forge #5758?
|
Basically, it links to https://stackoverflow.com/questions/24085906/unique-data-table-do-not-handle-keys-properly |
|
Late warning about unexpected consequences is better than no warning, so still despite that I see value in it. |
|
Is lower performance meant to be the unexpected consequence? Otherwise
we seem to do the same thing as what R does in identical().
Given how most systems already speak UTF-8 as the native encoding,
changing the default fread(encoding=) argument to "UTF-8" should be
mostly harmless. Only people who read invalid UTF-8 or use a non-UTF-8
system will notice a difference.
|
|
Good points! I've added I have also filed this now as breaking change. |
|
The warning is not caused by mixed encodings. The previous example
sorts a vector of CE_NATIVE strings, which is actually safe to perform
without conversion to UTF-8, but the warning still appears. It would be
more precise to say "Detected strings not marked as ASCII or UTF-8 and
converted them to UTF-8 for detection of duplicates. Please convert
your character columns using enc2utf8() to avoid the on-the-fly
conversion."
Could you please explain why is the result unexpected when calling
duplicated() or unique() without converting strings to UTF-8 first?
After all, identical(enc2utf8('ø'), iconv('ø', to='latin1')) is TRUE,
even if those are different CHARSXPs under the hood.
|
|
No obvious timing issues in HEAD=warn_encodings Generated via commit 2f49a9a Download link for the artifact containing the test results: ↓ atime-results.zip
|
|
I guess what Jan meant by unexptected consequences is that |
|
Simply longer is not a problem, I thought that something that was giving T for == can now start to return F |

Closes #469
Not exactly what Arun suggested but seems like the best option since we encode to UTF8 in
forderv. Is a warning too much here?